Integrated system for high throughput capture of genetic diversity

ABSTRACT

Compositions and methods for rapid and highly efficient characterization of genetic diversity in organisms are provided. The methods involve rapid sequencing and characterization of extrachromosomal DNA, particularly plasmids, to identify and isolate useful nucleotide sequences. The method targets plasmid DNA and avoids repeated cloning and sequencing of the host chromosome, thus allowing one to focus on the genetic, elements carrying maximum genetic diversity. The method involves generating a library of extrachromosomal DNA clones, sequencing a portion of the clones, comparing the sequences against a database of existing DNA sequences, using an algorithm to select said novel nucleotide sequence based on the presence or absence of said portion in a database, and identification of at least one novel nucleotide sequence. The DNA sequence can also be translated in all six frames and the resulting amino acid sequences can be compared against a database of protein sequences. The integrated approach provides a rapid and efficient method to identify and isolate useful genes. Organisms of particular interest include, but are not limited to bacteria, fungi, algae, and the like. Compositions comprise a mini-cosmid vector comprising a stuffer fragment and at least one cos site.

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. ProvisionalApplication Serial No. 60/363,388, filed Mar. 11, 2002, the contents ofwhich are herein incorporated by reference in their entirety.

FIELD OF THE INVENTION

[0002] Methods to capture biological diversity in the form of genesencoding novel enzymes and proteins of commercial value are provided.Additionally, novel methods to rapidly sample and screen bacterialgenomes for novel genes of interest are described.

BACKGROUND OF THE INVENTION

[0003] Increasingly, bacterial genes are being used in variousindustrial and agricultural applications such as insect resistant crops,herbicide tolerant crops, or improved industrial processes. Bacteria arecapable of carrying out virtually every known biochemical process andare therefore a good source of proteins and enzymes for use in a widevariety of commercial processes. Bacterial genes of utility includethose that encode proteins with insecticidal activity, those thatcatalyze industrial processes, proteins responsible for antibioticresistance and virulence factors. While use of biologically derivedgenes and proteins is increasing, it remains a cumbersome process todiscover and characterize genes encoding proteins which are viable forcommercial application. Traditional approaches to identify commerciallyviable genes and proteins have relied on following the function ofinterest. Newer genomics approaches have attempted to sequences genes asquickly as possible and identify their function by homology to knowngenes. It remains unclear how efficient it is to sequence entire genomesof a given organism to identify new genetic activities. Efforts tocharacterize the genomes of organisms have been ongoing since tools ofmolecular biology became available for this purpose. These studies oftenlook at the relatedness of different species or at the degree ofdifference between two or more organisms. There have been no systematicefforts to characterize the specific genes carried by plasmids, smalldiscrete genetic elements of bacteria, and to use such characterizationas a means to rapidly identify bacterial genes with commercialapplications.

[0004] Bacterial species often carry genetic elements called plasmidsthat include a variety of genes. Often these plasmid encoded genes givethe strain of a given bacterium commercially important characteristics.For instance, many Bacillus thuringiensis (Bt) strains are used asmicrobial pesticides. The genes responsible for producing theinsecticidal proteins of these strains are plasmid encoded. Bt strainHD-1 has been used for decades as a microbial spray against variouslepidopteran pests. Since many genes of commercial utility reside on theplasmids, not within the chromosomal DNA, whole-genome based genomicsapproaches to discover new genes are inefficient because one repeatedlysequences the chromosomal DNA. A number of techniques have beendeveloped to increase the efficiency of gene discovery.

[0005] The use of microarrays allows comparison of several species (thetest strains) to a known, sequenced species (the reference strain). Inorder to perform this method, one must generate the entire DNA sequenceof a genome (the reference genome), then synthesize oligonucleotidescorresponding to much of the reference genome, and imbed theseoligonucleotides onto a matrix, such as a chip. One drawback of thismethod is that one must have the DNA sequence of a closely relatedreference strain. Only regions of similarity are identified whileregions of non-similarity must be inferred. Furthermore, this methoddoes not provide a method to determine nucleotide sequences of thevariant regions present in the test strain.

[0006] Polymorphism mapping involves digestion of the genome with rarerestriction enzymes and separation of the resulting fragments on pulsedfield (PFGE) or field inversion gels (FIGE). This method can be used toscreen related strains to determine the relative level of relatedness,and to map regions that are dissimilar between strains. However, thismethod does not generate any sequence information about the novelregions present in strains.

[0007] Differential hybridization techniques have the ability toidentify regions of difference between strains, and to identify cloneslikely to contain differences. However, differential hybridizationtechniques are well known for their technical difficulty. The presenceof repetitive DNA elements in genomes can substantially interfere withthis method. While differential hybridization techniques based onhybridization of bulk PCR reactions are somewhat more technicallyfeasible, none of these techniques has been used for rapid testing andcharacterization of plasmid sequences.

[0008] Because of the enormous genetic diversity among bacterialplasmids, methods are needed to facilitate the rapid and efficientidentification of useful nucleotide sequences. There is a need toidentify more bacterial genes with commercial relevance for suchapplications and to do so rapidly and efficiently.

SUMMARY OF INVENTION

[0009] Methods for rapid and highly efficient characterization ofgenetic diversity in organisms are provided. The methods involve rapidsequencing and characterization of extrachromosomal DNA, particularlyplasmids, to identify and isolate useful nucleotide sequences. Themethod targets plasmid DNA and avoids repeated cloning and sequencing ofthe host chromosome, thus allowing one to focus on the genetic elementscarrying maximum genetic diversity. The method involves generating alibrary of extrachromosomal DNA clones, sequencing a portion of theclones, comparing the sequences against a database of existing DNAsequences, using an algorithm to select said novel nucleotide sequencebased on the presence or absence of said portion in a database, andidentification of at least one novel nucleotide sequence. The DNAsequence can also be translated in all six frames and the resultingamino acid sequences can be compared against a database of proteinsequences. The integrated approach provides a rapid and efficient methodto identify and isolate useful genes. Organisms of particular interestinclude, but are not limited to bacteria, fungi, algae, and the like.

[0010] The sampling methods above can be used to rapidly identify andclone novel genes that have homology to existing genes. Novel genes areidentified by this method. These novel genes would be difficult if notimpossible to identify by other methods, such as hybridization. Includedin this invention are methods to identify novel delta-endotoxin genes,novel cellulase genes, and the like. The sampling methods above can alsobe used to identify novel genes that have little homology to existinggenes.

[0011] Compositions comprise a mini-cosmid vector comprising a stufferfragment and at least one cos site. This vector is useful for generatinga library of DNA clones with reduced insert sizes relative toconventional cosmid or fosmid vectors. This reduced insert size isuseful for generating libraries of extrachromosomal DNA which may rangein size from 0-200 kb or more.

DESCRIPTION OF FIGURES

[0012]FIG. 1 provides a diagram of Phase I of an improved sequencecapture strategy.

[0013] FIGS. 2 provides a diagram of two methods for Phase II of animproved sequence capture strategy.

[0014]FIG. 3 provides an example of a sequence capture strategy toisolate novel clones.

[0015]FIG. 4 shows a graphical map of minicos-I. Genes for ampicillinresistance (amp), kanamycin/neomycin resistance (kan) are shown, as wellas location of multiple cloning sit (MCS), cos sites (cos), crerecombinase recognition sites (lox), and the origin of replication (pUCorigin). Lox sites are organized such that incubation with crerecombinase yields to circular molecules, one of which contains theinsert DNA, amp resistance, origin of replication, but lacks the stufferfragment, kan resistance, and sacB gene.

DETAILED DESCRIPTION

[0016] The invention describes a method to rapidly characterize thegenetic diversity in microorganisms and identify genes and nucleotidesequences of commercial interest, without the need for sequencing theentire genome. This method involves a unique coupling of severaltechniques to create an integrated strategy; generation of librarieswith inserts of specific sizes, sampling of sequences, use of algorithmsto pick clones most likely to have novel sequences, followed by methodsfor efficient sequencing of novel clones. The method provides for therapid sampling of genetic diversity and permits identification of genesand nucleic acid molecules that may not be identified by hybridizationor other available methods. Use of the method provides for rapiddiscovery of new genes and proteins.

[0017] Rapid methods for identifying novel nucleotide sequences fromextrachromosomal DNA in a host organism are provided. While the methodsare described generally in terms of characterizing bacterialextrachromosomal DNA, the method is applicable to any host organism aswell as to direct isolation of DNA from environmental sources such assoil, water, and the like. Direct isolation removes the necessity ofculturing the organism or strain prior to isolation of DNA. Hostorganisms from which the libraries may be prepared include prokaryoticmicroorganisms, such as Eubacteria and Archaebacteria, lower eukaryoticmicroorganisms such as fungi, some algae and protozoa, as well as mixedpopulations of plants, plant spores and pollen.

[0018] The method involves an integrated strategy for isolation andidentification of novel nucleotide sequences. By “novel nucleotidesequences” is intended nucleotide sequences that share less than about30% homology, preferably less than about 60% homology, more preferablyless than about 80% homology, most preferably less than about 90%, 91%,92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% homology to any sequence inthe database used for comparison.

[0019] This method can be described as having two phases, Phase I andPhase II. In Phase I of the method, databases of plasmid sequences aregenerated by sequencing random clones of plasmid DNA. A schematic of thesteps involved in Phase I is shown in FIG. 1. In Phase II, clonesidentified in Phase I are heavily sampled by sequencing to capture thesequence diversity present in these clones.

[0020] In one embodiment, the following steps are used to generate adatabase.

[0021] In Step 1, the DNA is prepared and enriched for extrachromosomalDNA. By “extrachromosomal DNA” is intended plasmids, extrachromosomalphage, linear plasmids, and any other extrachromosomal elements. In thisstep, DNA (from isolated bacteria, mixtures of bacteria, primarycultures of bacteria and other organisms such as fungi, or even DNA fromenvironmental samples) is prepared by one of several methods known inthe art to enrich for plasmid DNA (see Sambrook and Russell, Eds. (2001)Molecular Cloning: A Laboratory Manual (Laboratory Press, New York)). Inone embodiment, DNA from individual organisms is released by cell lysisand plasmid DNA partially purified, by methods including gelelectrophoresis, pulsed field electrophoresis (PFGE/field inversion gelelectrophoresis (FIGE) (see, for example, Wang and Lai (1995)Electrophoresis 16:1-7), cesium chloride gradient centrifugation,alkaline lysis, purification of plasmid DNA by adhesion to and elutionfrom a DNA binding column, or other methods know in the art to isolateDNA, specifically to isolate plasmid DNA from chromosomal DNA. The DNAcan also be treated with DNA exonucleases that preferentially degradeopen circular or linear DNA, but do not degrade closed circular DNA. DNAof a particular plasmid may also be purified by methods known in theart, such as, gel electrophoresis followed by excision of agarosefragments, and purification of DNA from the gel slice by methods knownin the art (see Sambrook and Russell, supra).

[0022] In Step 2, the resulting DNA (referred to here as plasmid DNA) isthen fragmented. It is important to note that the size of thefragmentation is such that large plasmids are cleaved, and representedas a heterogeneous population of different size molecules when analyzedby agarose gel electrophoresis. Small plasmids (less than the averagesize of the fractionated DNA purified) may or may not be incorporated inthe resulting population of different size DNA molecules. Methods tofragment this DNA include sonication, partial digestion with DNAse,shearing by viscosity (e.g. passage through a nebulizer), and partialdigestion with a restriction endonuclease (e.g. Sau3AI). It is importantto determine a fractionation protocol that yields the correct size DNAfragments.

[0023] In one aspect, the ideal fragment size should be between about 10to about 20 kb, and more preferably about 15 kb. This size is smallerthan sizes typically used for genomic libraries. Using smaller DNA (e.g.15-20 kb vs. 35-40 kb for typical cosmid libraries) has severaladvantages. First, shearing of DNA to smaller sizes will result inbetter representation of plasmid sequences then generation of librariesusing larger fragments. This is because, unlike genomic DNA, plasmidsare heterogeneous is size, and as a whole substantially smaller thancircular bacterial genomes. It is important to allow plasmids of about50 to 100 kb or larger to be represented efficiently in the resultinglibraries. Second, generation of smaller fragments allows one to utilizeDNA of lower quality than is required to generate large insertlibraries. It is well known in the art that generation of large DNAinserts (e.g. for cosmid libraries) requires careful preparation of DNAto avoid randomly shearing DNA to size smaller than optimum forgenerating such libraries (150 kb or more; see Sambrook, supra). Thiscan prove to be quite technically difficult, especially in preparing DNAfrom bacterial or eukaryotic cells that are hard to lyse, or for cellsthat produce large amounts of endonucleases. Thus, the smaller sizeinserts required for the methods of the invention relative to methodsrequiring very large molecular weight DNA facilitates librarygeneration. Furthermore, coupling of DNA preparation with cloning inspecialized vectors (e.g. mini-cosmid vectors as described below) allowsfor generation of libraries without the need to gel-purify DNA, furtherimproving the throughput of library generation.

[0024] In another aspect of this invention, the DNA resulting from StepI is fractionated to yield smaller insert sizes of about 1 to about 5kb. Purification of fragments of this size may be preferable in someinstances. For example, this might be the preferred method when oneexpects the majority of episomal sequence obtained to be novel; or whenone wishes to capture the entirety of the episomal sequences, and notexclude relatively small plasmids. In this aspect of the invention, thefragment size should be between about 1 to about 5 kb, and morepreferably about 1.5 kb.

[0025] In some instances it may be advisable to generate DNA fragmentssmaller then 10-20 kb. (e.g. 5 kb). One can achieve this by using themethods illustrated in Step 2, by modifying the fragmentation conditionsto yield smaller fragments. Since the DNA fragments isolated aresmaller, this method will require a larger sequencing effort than amethod generating a 15 kb insert; since a higher percentage of the cloneis sequenced in step 4, and more clones must be analyzed to assurecoverage of the diversity in any one strain. In this modified method, itmay be preferable to clone the DNA fragments directly into a plasmidvector such as pBluescript (Stratagene) or a cDNA cloning vector such asλZap® (Stratagene).

[0026] After fractionation, the resulting DNA molecules are separated(typically by electrophoresis through agarose gels) and the appropriatesize fragments purified Methods to purify fragments are well known inthe art. Examples of purification methods include treatment of gelslices with agarase (β-agarase), or chemical digestion of agarosefollowed by chromatography.

[0027] In Step 3, the fragments prepared in step 2 are ligated to avector, and the resulting molecules transformed into bacteria such as E.coli. DNA ligation reactions are performed by methods known in the art(e.g., Sambrook and Russell, supra), usually by incubating a quantity offragmented bacterial plasmid genome with a quantity of E. coli cloningvector (e.g. pBluescript from Stratagene) in the presence of T4 DNAligase at 16° C. for 18 hours; or according to manufacturer'sdirections. Alternatively DNA may be ligated for at least 2 hours atabout 25° C. Ligated DNA is transformed into a bacterial host by eitherelectroporation or chemical transformation methods known in the art(see, for example, Sambrook and Russell, supra) Resulting colonies arepicked, grown in liquid culture, and plasmid DNA prepared by methodsknown in the art.

[0028] In one aspect of this invention, the vector used is a commoncloning vector, such as pBluescript (Stratagene). In other aspects ofthe invention, the cloning vector is specially designed to allow facilecloning of plasmid sequences (see “Mini-Cosmid Vectors”, below). Ideallythe vector will allow library fragments greater than about 5 kb,preferably up to about 25 kb. The vector used in the invention may beplasmid, phage, cosmid, phagemid, virus or selected portions thereof.

[0029] In Step 4, DNA is prepared from individual clones from thelibrary, and a portion of the clone is sequenced. By “portion” isintended less than about 30% of the size of the clone. In general, thisis accomplished by sequencing the ends of the insert DNA with primersthat anneal to the DNA region adjacent (but outside the cloning regionof the vector), and prime DNA synthesis into the insert DNA. A sampleset of sequences is obtained from each clone. In general this isperformed by preparing DNA from each clone, and then performing DNAsequencing reactions using primers that are adjacent to (and prime DNAsynthesis in the direction of) the insert DNA fragment. For example, onecan prepare 96 well plates containing media and inoculate each well witha colony representing a DNA clone. Multiple 96 well plates can beprepared in this manner. The resulting inoculated wells are grown(usually with shaking at 37° C. overnight) to saturation, and plasmidDNA prepared by methods known in the art (see, for example, Carninci etal. (1997) Nucleic Acids Res. 25(6):1315-1316) or by use of an automated96-well miniprep kit protocol (QlAprep Turbo, QIAGEN).

[0030] In Step 5, the DNA sequence data resulting from step 4 iscompared against a database of existing DNA sequences, includingsequences of previous fragmented clones. By “existing DNA sequences” isintended DNA sequences that can be found in a public database, such asGenbank, PFAM, or ProDom. By “database” is intended a collection of dataarranged for ease and speed of search and retrieval. The database cancomprise either nucleic acid sequences and/or deduced amino acidsequences. The databases can be specific for a particular organism or acollection of organisms. For example, there are databases for the C.elegans, Arabadopsis sp., M. genitalium, M. jannaschii, E. coli, H.influenzae, S. cerevisiae and others. In preferred embodiments, thedatabase comprises only known endotoxin proteins. In another preferredembodiment, the database comprises only known lignocellulose-degradingenzymes. The database can be a public database, can comprise sequencesobtained by end sequencing of various clones, or can be generated fromgenomic sequences. This comparison is performed with an algorithmdesigned to parse clones based on presence or absence of their partialsequences in the database. By “algorithm” is intended a recursive,computational procedure for solving a problem in a finite number ofsteps. By “parse” is intended to separate or sort into parts. Clones nothaving their partial sequences represented in the database areidentified in this manner. These are referred to here as novel clones.The sequences are tested to identify novel sequences, likely torepresent novel clones. This can be done by, for example, performingsimilarity searches against a database of all known sequences.Typically, this is performed by using the BLAST series of algorithms(Altschul et al. (1990) J. Mol. Biol. 215:403-410; Altschul et al.(1997) Nucleic Acids Res. 25:3389-3402; Gish and States (1993) NatureGenet. 3:266-272). BLAST algorithms compare a query sequence(s) forsimilarity to a database of known sequences and identifies sequences inthe database(s) with highest scoring probability of similarity. Theresults of BLAST searches are typically expressed by a ‘BLAST score’which is an expression of the probability of the two sequences NOT beingtruly similar. Thus, low BLAST scores suggest high degrees ofsimilarity. Proteins or DNA regions with identical amino acid or DNAsimilarity can yield scores of 0; suggesting the probability of the twosequences not being related is zero (since they are identical). Highscoring BLAST similarities often have values of c⁻⁵⁰ or greater.Selection of novel sequences can be done by empirical inspection ofblast scores, and sorting of novel sequences (having no high scoringmatch in a blast search) from those sequences having blast scores likelyto indicate identity (for example P₀ of e³¹ ¹⁰ or e⁻²⁵ or e⁻¹⁰⁰ orgreater). Alternatively, one can analyze batches of blast scores usingalgorithms designed to parse high scoring reads from low scoring reads.An example of the logic involved in such an algorithm is described inExample 3. The values of the BLAST score cutoffs are intended to beexemplary. One can vary the cutoff values used without substantiallyreducing the value of the method. One way to determine the values to usefor this procedure is by empirically setting values, running tests, andempirically determining the most useful values. Using such methods, onecan quickly identify only the clones that have at least one andpreferably two unique sequences (i.e. not previously identified in thedatabase). Clones having one or more unique sequences are then sequencedin their entirety. In one embodiment, the nucleotide sequence istranslated into all six reading frames to obtain all possible amino acidsequences and then the amino acid sequences are compared to a proteindatabase.

[0031] One such program is BLASTX. (Altschul et al. (1990) J. Mol. Biol.215:403-410; Gish and States (1993) Nature Genet. 3:266-272) BLASTXsearches may be performed against a large set of known genes (forexample, the Genbank database). Alternatively, such searches may beperformed locally, against smaller databases containing genes ofparticular interest to the user.

[0032] While the algorithm may be a computer program, it is not criticalto the invention that the algorithm be a computer program, or that ifwritten as a computer program, that it be written in any particularcomputer language. For example, the algorithm may be as simple as a setof instructions for a person to utilize to identify and sort individualsequences by hand. Alternatively, such steps may be incorporated into acomputer program. In one aspect of this invention, the algorithm isrepresented in a computer program written in C++, Java, or Basicprogramming language. It is understood that one may create such aprogram in one of many different programming languages. In one aspect ofthis invention, this program is written to operate on a computerutilizing a UNIX™ operating system. In this aspect, it is preferable ifthe computer program is designed to be compatible with DNA sequenceassembly and analysis software. For example, Phred, Phrap, and Consed(Ewing et al. (1998) Genome Research 8:175-185; Ewing and Green (1998)Genome Research 8:186-194; Gordon et al (2001) Genome Research11(4):614-625) are powerful programs used to sort DNA sequences byquality and assemble overlapping sequence reads. Consed (Gordon et al.(1998) Genome Research. 8: 195-202) is a program designed to allowediting and analysis of overlapping sequence reads generated by use ofPhred and Phrap programs. It may be preferable to design the computerprogram to accept sequence files resulting from Phred/Phrap, Consed, orother DNA sequence assembly software.

[0033] In one aspect of this invention, one continues to sequence randomclones, and does not institute phase II. In this aspect, one continuesto sequence random clones to generate a database of diversity resultingfrom the extrachromosomal DNA of interest. This may be preferable whenthe number of unique sequences resulting from phase I is high, for e.g.greater than about 66% of resulting sequences.

[0034] In Phase II, the DNA sequences contained in the novel clones areobtained. This can be accomplished by one of several methods, which canresult in generation of complete sequence of all genes contained innovel clones. Two methods by which this can be accomplished areillustrated in FIG. 2, ‘Method A’, and an alternative method, ‘MethodB’.

[0035] Method A involves generation of randomly sheared DNA fragmentsfrom novel clones to generate smaller DNA fragments (for example 1-3kb), cloning of these DNA fragments to generate a library of sub-clones,and sequencing of a number of these sub-clones for each novel clone. Asummary of the steps involved in Method A is shown in FIG. 2. Forexample, a particular 20 kb clone may be digested with restrictionenzymes liberating the novel insert DNA. That DNA fragment is purifiedby gel electrophoresis, and fragmented to small fragments (1-3 kbpreferably) by methods known in the art, and described in step 2 above.A number (e.g. 10-50) of the resulting subclones are then picked andtheir end sequences determined as in step 4 above. The resulting DNAsequences are assembled to generate the sequence of the 20 kb DNAfragment. It is important to note that for the purposes described inthis disclosure, it is not necessary to generate complete, unambiguousDNA sequence for all nucleotides (or even a majority of the nucleotides)contained in this fragment.

[0036] Method B describes one aspect of the invention. In this aspect, aseries of reactions are performed to generate the sequence from thenovel clones in a rapid fashion. A summary of the steps involved inMethod B are shown in FIG. 2.

[0037] In one aspect of Method B, the clones from step 5 are mutagenizedwith a transposable element in vitro (e.g. Tn5). The transposon systemused inserts a transposable element that contains the DNA for anantibiotic resistance marker not otherwise present on the clones.Methods for mutating clones are well known in the art (see Sambrook,supra).

[0038] In most cases the order of the reactions can be inverted withouthindering the outcome of the experiment. If the procedure involvestransforming into E. coli, it is advisable to perform this step second.

[0039] Next, the sequences of each novel clone are obtained by preparingpurified DNA from several of the Tn-insertion clones (10-50 per novelclone, depending on size of the original clone) and sequencing theinsert DNA by priming DNA synthesis from the transposable element. Eachrandom insertion of the transposon will generate a new primer bindingsite.

[0040] The resulting DNA sequences are compiled and the sequence of thenovel clone determined.

[0041] Mini-cosmid Vectors

[0042] In one embodiment, the vector used for generating the library isa “mini-cosmid” vector. These vectors are defined predominately by theinsertion of a large stuffer fragment between two COS sites; this allowsone to generate “mini-cosmid” libraries. By “stuffer fragment” or“stuffer sequence” is intended a DNA fragment useful to control the sizeof the cloned insert within a vector. It is recognized that the size mayvary to obtain clones of varying lengths. Generally, the stufferfragment will have characteristics as described below.

[0043] These mini-cosmid libraries are prepared similarly to cosmidlibraries, except that the presence of a large stuffer fragment altersthe average insert size allowable from about 35 to about 40 kb to asmaller size, for example, about 15 to about 20 kb, or about 20 to about25 kb of insert. The vectors designed and created for this purpose arereferred to herein as mini-cosmid vectors. The size of the stufferfragment will vary depending upon the preferred size of the insert.Generally, the stuffer fragment will range from about 5 to about 35 kb,including sizes of about 10 to about 30 kb, about 15 to about 25 kb, andabout 20 kb.

[0044] These vectors use COS sites to allow size selection of inserts bypackaging in phage, and therefore remove the need for gel purificationof digested DNA. The stuffer fragment is located between the COS sitesof the vector. This unique feature allows one to create libraries withreduced insert sizes relative to conventional cosmid or fosmid vectors.This reduced insert size is useful for generating libraries of bacterialplasmids, which may range in size from 0-200 kb or more (and usually5-150 kb or more).

[0045] The stuffer fragment can be engineered to have several usefulfeatures. In one aspect of this method, the stuffer DNA contains the DNAencoding a functional copy of the Bacillus subtilis sacB gene (forexample from the vector PRE112 (Edwards et al. (1998) Gene 207:149-157).sacB encodes a levansucrase that is toxic to gram-negative bacteriagrown in the presence of sucrose; sacB activity leads to the formationof levan polymers that kill the cell. Thus, a stuffer fragment encodingsacB allows a way for one to select against presence of the plasmid, ormore specifically the stuffer fragment, in E. coli.

[0046] Furthermore the stuffer fragment can be engineered to contain acopy of an antibiotic resistance gene, such as the chloramphenicol acyltransferase gene. Presence of such a gene can allow one to either selectfor clones containing this gene, or against constructs containing thisgene by replica plating.

[0047] Furthermore the stuffer fragment can contain an origin ofreplication that confers ability of the resulting plasmid to replicatein hosts other than E. coli, including, for example, Bacillus andStreptomyces species.

[0048] Furthermore the boundaries separating the stuffer fragment fromsurrounding DNA can be designed to have features which allow one toremove the stuffer fragment from the plasmids at a time after packagingand transfection into E.coli. For example, one can engineer theboundaries of the stuffer fragment to have cleavage sites for one orseveral rare restriction enzymes, such as Pmel, Pacl, SfiI, or anintron-encoded nuclease. Thus, digestion with this rare enzyme willexcise the stuffer fragment without digesting the insert-containingvector anywhere else (i.e. in the insert DNA). The digested vector canthen be relegated to create clones that now lack the stuffer fragment.This can be useful in preparing the DNA for subsequent analysis such astransposon mutagenesis.

[0049] Removal of the stuffer fragment may be useful where one wishes toperform methods that would be hindered by the presence of the stuffer,such as transposon mutagenesis. Furthermore the boundaries of thestuffer fragment can be designed to have sites recognized bysite-specific recombinases, including transposases. One example of sucha recombinase is the cre recombinase, which catalyzes recombination atspecific nucleotide sites (lox sites). It is understood that many of thevarious known site-specific recombinases will function as a sitespecific recombinase system for the stuffer. Such recombinase systemsinclude cre/lox system, Flip recombinase system (based on therecombinase for the yeast two micron plasmid), P1 phage basedrecombinases, (see for example, Stark et al. (1992) Trends Genet.8:432-95). Hallet and Sherratt (1997) FEMS Microbiol. Rev. 21:157-78.Thus, one can remove the stuffer fragment by incubation of the vector inthe presence of a site specific recombinase such as cre, either invitro, or by passaging the vector through a strain expressing orinducible to express the cre recombinase.

[0050] The vectors of this invention provide a number of ways to removethe stuffer fragment from the vector after transfection into E. coli.The resulting plasmid is then transformed, transfected, electroporated,or otherwise transferred into E. coli, and clones having lost thestuffer fragment, but containing a transposon insertion (as judged byresistance to the antibiotic contained within the transposon) areidentified. This results in the generation of a number of clones foreach novel clone, with transposon insertions randomly distributedthroughout the circular plasmid.

[0051] Techniques by which removal of the stuffer fragment can beaccomplished include but are not limited to:

[0052] 1. Digestion of DNA with a restriction enzyme, such thatdigestion with this enzyme cleaves at each end of the stuffer fragment.

[0053] 2. Treatment of the DNA in vitro with a trans-acting sitespecific recombinase such as the cre recombinase. This method is usefulin the case that the vector has lox sites flanking the stuffer DNA, andarranged in the proper orientation to excise the stuffer fragment

[0054] 3. Transformation of the DNA into an E. coli strain thatexpresses the cre recombinase. (for example λKC: Elledge et al. (1991)Proc. Natl. Acad. Sci. USA, 88:1731-1735). This method is useful in theaspect in which the vector has lox sites flanking the stuffer DNA, andarranged in the proper orientation to excise the stuffer fragment.Clones identified in this strain as transformants are likely to havelost the stuffer fragment by cre-mediated deletion of the stufferfragment.

[0055] 4. Amplification of the novel DNA insert by PCR with a highfidelity thermostable polymerase (such as Pfu), and cloning theresulting PCR product into a vector that lacks the stuffer fragment, andhas not been mutagenized with a transposon.

[0056] In principal, one can use fundamentally any DNA as a stufferfragment. However, there are characteristics of the stuffer DNA thatprovide advantages over other possible choices. First, it isadvantageous to use a stuffer fragment that has few restriction sites.Addition of such a large piece of DNA can create problems in identifyingunique sites elsewhere in the insert containing clone. It is alsoimportant that the stuffer fragment not contain restriction sites forthe critical restriction sites of the vector, such as the XbaI site thatseparates the COS sites, and the restriction enzymes in the polylinker.

[0057] Second, the stuffer fragment should be known to propagate in E.coli, and to lack origins of replication or large inverted repeats thatwould interfere with plasmid propagation, or cell growth.

[0058] Current vectors for cloning and analysis of DNA from prokaryoticorganisms fall into the following classes.

[0059] General plasmid-based cloning vectors, such as pUC118(Stratagene), pBS SK+ (Stratagene), are designed or cloning of small DNAinserts, usually one gene. These vectors are quite useful for cloninggenes amplified by PCR, and many versions of such plasmids arecommercially available by suppliers such as Stratagene, Promega, andInvitrogen. However, the ability of all insert sizes to replicate inthese vectors, and the growth advantage of small inserts over largerinsert sizes reduces their usefulness for use in the cloning of genomes.Cloning of genomic or other complex DNA into these vectors typicallyrequires gel-purification or other size selection of the insert DNA toallow cloning of appropriate size inserts. Furthermore, when using thesevectors, one tends to clone relatively small DNA fragments of about0.5-10 kb, usually no more than 5 kb. The reduced size of genomicinserts increases the number of clones that must be screened toadequately cover the genome.

[0060] Cosmid vectors such as pWE15, allow cloning of fairly large DNAfragments (up to 40 kb) by the use of COS site to package ligated DNAinto lambda. However, the DNA must be carefully prepared to obtain DNAof at least 100 kb, and preferably 150 kb. This is needed to ensure thefragmentation by partial digest yields two ends on each molecule thatare digested with the restriction enzyme, and not sheared randomly. ThisDNA is typically gel-purified after digestion. Vectors such as Supercos(Stratagene), possess two COS sites, and therefore allow one to clone15-40 kb inserts without gel purification; this is because inserts mustbe a minimum size to allow them to be packaged by lambda packagingextract. However, since the DNA cloned is so large, one must carefullyprepare the DNA as for single COS vectors.

[0061] cDNA cloning vectors such as LambdaZap™ allow cloning of smallinserts, up to 10 kb, by use of lambda packaging extracts. Phage can bemanipulated, then induced to produce plasmid by induction ofsingle-stranded DNA by superinfection with M13 helper phage, such asR408, followed by transfection into a fresh host strain (Short et al.(1988) Nucleic Acids Research 16:7583-7600).

[0062] Mini-cosmid vectors are useful in the rapid generation oflibraries of medium to large insert size. The ability to package theinsert DNA after phosphatase treatment, and without size selectionprovides a speed and insert size advantage over plasmid-based cloning,and allows library construction with lower quality DNA inserts than isrequired for cosmid library or BAC library construction. Mini-cosmidvectors allow excision of the insert as a minimal vector, containing anantibiotic resistance gene (e.g. ampicillin resistance) a colEl originof replication. To facilitate size reduction of the mini-cosmid clones,several features are designed into the vector.

[0063] In one embodiment, the minimal vector is flanked by recombinationsites, for example lox or frt sites, organized such that incubation of afull insert containing mini-cosmid clone with, for example, the Crerecombinase results in excision of the minimal vector. Excised minimalvector can be selected by plating on antibiotic (such as ampicillin) andcounter selecting by plating on sucrose. Thus, only clones that maintainamp and have lost SacB function will grow. One can further confirm theexcision by plating amp resistant clones onto kanamycin. Since kanamycinresistance resides outside of the minimal vector, the clones should beampicillin resistant, sucrose sensitive, and kanamycin sensitive.

[0064] As an alternative to use of recombination sites, mini-cosmidvectors contain a series of restriction enzyme sites at the border ofthe minimal vector. Thus, one can reconstitute the minimal vector bydigesting with one or more of these enzymes, diluting the digestionmixture, re-ligating the diluted digestion mixture, and transformingthis mixture into a cell. One may then select for formation of theminimal vector as described for recombination sites above.

[0065] Further Methods

[0066] In one aspect of the invention, one may further identify DNAregions surrounding the novel clone. For example, one may accomplishthis by generating hybridization probes and screening an existing DNAlibrary (such as the library sequenced in Phase I). Alternatively, onemay generate a library of larger inserts (for example a cosmid library),and screen for clones likely to contain DNA adjacent to the novel cloneof interest. Alternatively, one may use one of many methods to identifysequences adjacent to clones. For example, one may clone and sequenceregions flanking a known DNA by inverse PCR (Sambrook and Russell,supra). Another such method involves ligating linkers of known sequenceto genomic DNA digested with restriction enzymes. Then generating PCRproduct using a oligonucleotide homologous to the oligo linker, and anoligo homologous to the region of interest (e.g. the end sequence of anovel clone). A kit for performing this procedure (Genomewalker®) isavailable from Clonetech.

[0067] The method described here is useful for generating large datasetscontaining gene sequences of commercial value. For example, it is wellknown that insecticidal proteins, such as the Bacillus thuringiensisdelta-endotoxin genes, are found predominately on large extrachromosomalplasmids. Thus isolation and sequencing of plasmids from Bacillusstrains, such as Bacillus thuringiensis strains is likely to lead toidentification of novel delta-endotoxin genes. Such genes are likely tobe valuable for controlling insect pests. Furthermore, many Clostridiastrains are known to have large extrachromosomal plasmids, and some ofthese are known to contain virulence factors, as well as toxins such asiota toxin (see, for example, Perelle et al. (1993) Infect. Immun.61(12):5147-5156, and the references cited therein). Furthermore, it hasbeen shown that the majority of variability for Clostridia strainsappears to occur due to plasmid content (see, for example, Katayam etal. (1996) Mol. Gen. Genet. 251:720-726). Thus, sequencing of theplasmids of multiple Clostridia strains will quickly capture a largeamount of genetic diversity. There has been report of a homolog ofdelta-endotoxin gene present in Clostridia sp. (Barloy et al. J.Bacteriol. 178:3099-3105).

[0068] Tumor-inducing and symbiotic plasmids are common in Agrobacteriumand Rhizobium strains (e.g., Van Larebeke et al. (1974) Nature252:169-170). Thus sequencing of bacterial plasmids, especially thosefrom known plant pathogens, is likely to identify genes involved inplant-pathogen interactions including genes involved in or required forboth virulence and avirulence.

[0069] Much of the diversity present in bacterial populations is presenton plasmids. Many of plasmids are known to contain virulence factors,important for infectivity or severity of infection by bacteriapathogens. Correspondingly, it is likely that many of the proteinsexpressed by plasmid genomes are likely to have value as vaccines. Forexample, both plasmids pXO1 and pXO2 of Bacillus anthracis encodeproteins required for pathogenesis during anthrax infection. pXO2encodes proteins that produce a protective capsule around the bacterium.The pXO1 plasmid encodes the three proteins of the anthrax toxincomplex, lethal factor (LF), edema factor (EF), and the protectiveantigen (PA). The PA protein (protective antigen) forms the basis of avaccine for anthrax. The quick and efficient sequencing of bacterialplasmids will yield information with which one can create a database ofproteins that might serve as effective vaccines.

[0070] The methods are useful for strain identification and typing. Notonly are bacterial plasmids a vital component of the diversity ofbacterial subspecies, they contribute substantially to the geneticdifferences between closely related strains. Plasmids are known to betransferable between related strains, and can result in modifiedcharacteristics such as ability to produce toxins, etc. Thus there iscommercial value in developing diagnostic tools based on plasmidsequences. DNA sequences generated by this method can be used forgenerating diagnostic tools. These diagnostic tools can be created bycomparing DNA sequences of plasmids obtained by this method, identifyingeither unique sequence regions, or regions shared by groups of plasmidsone wishes to identify. Oligonucleotides corresponding to the identifiedregions can then be synthesized by methods known in the art, and used toestablish PCR-based strain typing methods.

[0071] In the same manner, the methods can be used in medicaldiagnostics, that is, for the detection and identification of typing ofinfectious agents.

[0072] The present method is useful for analyzing the contribution ofextrachromosomal DNA for example, bacterial plasmids, to genomediversity. (Ng et al. Genome Research 8:1131-1141) sequenced the 191 kb‘dynamic replicon’ from a halophilic archaeon and found the presence of1,965 ORFs of 15 bp or larger.

[0073] A survey of Closdridium perfringens strains concluded thatserological variation as well as changes in pathological spectrum may beentirely due to loss, or acquisition of extrachromosomal elements(Canard et al. (1992) Mol. Microbiol. 6:1421-1429). A separate articleprovides further evidence for substantial amount of bacterial strainvariation as plasmid borne. (Katayam et al. (1996) Mol. Gen. Genet.251:720-726).

[0074] Strain variation due to plasmid content is also well known forBacillus strains, particularly Bacillus thuringiensis. For example,Carlton and Gonzales (1984) “Plasmid-Associated Endotoxin Production inBacillus thuringiensis” in Genetics and Biotechnology of Bacilli, eds.Ganesan and Hoch (Academic Press).

[0075] Isolation and sequence of plasmid DNA specific to bacteria hasseveral advantages over current methods for gene identification. First,since genes are identified by DNA sequence, this method is more likelyto identify genes with lower DNA similarity to known genes than canreadily be accomplished by hybridization. Second, since the plasmidgenomes of strains will be a fraction of the total genome size (1-20%),it will be possible to rapidly sample the genomes of many relatedbacteria, and quickly identify interesting genes. Third, since much ofstrain to strain variation exists due to plasmid differences; thismethod will be very efficient at capturing the major diversitydifferences in bacterial groups. Furthermore the efficiency of themethod increases as the size of the existing sequence data set increases(see Table 3). As the percent of novel clones detected drops from 50% to1% the efficiency of the method increases from 3-fold to 16-foldrelative to sequencing the bacterial genome (for a 15 kb insert size,see Table 3).

[0076] Though only specific bacterial species are described herein, itis understood that virtually all bacteria are likely to contain plasmidor episomal DNA, and that plasmid DNA can be selected from thesebacteria and utilized in the method to identify novel genes.Furthermore, it is understood that one need not necessarily havepurified the bacteria or other cell in order to isolate and analyze itsplasmid content; i.e. this method can be applied to samples from mixedpopulations, or of unknown origin, such as environmental samples.

[0077] While many of the commercial uses of the resulting sequences canbe apparent from direct inspection of the resulting sequences, one mayperform additional steps to identify further commercial uses of theresulting sequences or genes.

[0078] First, the sequences are compared by DNA and amino acid homologyto public and private gene databases, to identify any genes that arelikely to be homologous to known commercial proteins. This method canalso identify characteristics of otherwise novel proteins, for example,presence of common functional protein motifs such as ATP-bindingdomains, transmembrane regions, etc.

[0079] Second, the genes can be cloned into expression vectors in orderto produce proteins. For example, one may amplify the genes by PCR andclone them into an expression vector such as pGEX (New England Biolabs),transform this construct into E. coli. and express protein by methodsknown in the art (see, for example, Sambrook and Russell, supra). Onemay perform this step for all genes identified, or a subset based onresults of homology searches, or other criteria.

[0080] The proteins can then be tested either before or afterpurification for functions of commercial interest. Such functions couldinclude but are not limited to (1) insecticidal activity (2) ability todegrade enzyme substrates such as cellulose, hemicellulose, lignin,keratin, starch, etc. (3) ability to stimulate cell proliferation (4)ability to stimulate or suppress immune response, or stimulate orrepress activity of proteins involved in immune response (5) ability toconfer immunity against challenge by foreign protein or cell (6) abilityto induce or prevent cell death such as that created by apoptoticresponses (7) ability to inhibit microbial and fungal growth, inparticular discovery of novel antibiotic genes.

[0081] Notwithstanding the uses described herein, this method willresult in the identification of novel genes, and these novel genes willbe identified and found to be of use. Enzymes are one type of usefulproduct likely to be found in this method. Accordingly, genes thatencode enzymes will be identified. Such genes include enzymes belongingto the family of oxidoreductases, transferases, hydrolases, lyases,isomerases or ligases, as well as lignocellulose-degrading enzymes.Additionally other proteins such as insecticidal proteins, cry proteins,virulence factors, avirulence factors, binding proteins, structuralproteins, and receptor proteins will also be discovered by this method.

[0082] The method of the invention is a more rapid and efficient methodthan currently available. Current methods for discovering genes fallinto two classes: functional methods and genomic methods. Functionalmethods attempt to identify genes by virtue of the activity of the geneproduct, either as naturally expressed from an organism, or aftercloning and expression in a heterologous host, such as E. coli or yeast.Examples of functional methods include cloning of cDNAs into expressionvectors followed by assay for function, and identification ofinteracting proteins by a two-hybrid screen.

[0083] Genomic methods strive to identify novel genes by either (a)identifying genes with significant DNA similarity to known genes ofinterest; e.g. by hybridization, or by oligo capture techniques, (b) byexpression at time or in a tissue of interest; for example in amyloidplaques generated in Alzheimer's patients, or (c) by identifying thesequence of interesting genes after bulk sequencing of the genome.Examples of current methods are described by Lan and Reeves (2000)Trends in Microbiology 8:396-401. This review describes many ways inwhich one may compare two related species to identify differences.Included are description of using differential hybridization, use ofmicroarrays, and polymorphism mapping. The use of microarrays allowscomparison of several species (the test strains) to a known, sequencedspecies (the reference strain). In order to perform this method, onemust generate the entire DNA sequence of a genome (the referencegenome), then synthesize oligonucleotides corresponding to much of thereference genome, and imbed these oligonucleotides onto a matrix, suchas a chip. This process is less desirable for rapid determination ofplasmid-encoded genes of commercial value than the proposed method forseveral reasons. One drawback is that one must have the DNA sequence ofa closely related reference strain, and one must synthesize chipscontaining many oligonucleotides. Further, one can only identify regionsof similarity; regions of non-similarity must be inferred. Furthermore,this method does not provide a method to determine the DNA sequence ofthe variant regions present in the test strain.

[0084] Polymorphism mapping for example by digestion with rarerestriction enzymes and separation of the resulting fragments on pulsedfield (PFGE) or field inversion gels (FIGE) can be used to screenrelated strains to determine the relative level of relatedness, and tomap regions that are dissimilar between strains. However, this methoddoes not generate any sequence information about the novel regionspresent in strains, and does not identify genes of commercial value.Differential hybridization techniques have the ability to identifyregions of difference between strains, and to identify clones likely tocontain differences. However, differential hybridization techniques arewell known for their technical difficulty. Furthermore, the presence ofrepetitive DNA elements in genomes can substantially interfere with thismethod. Differential hybridization techniques based on hybridization ofbulk PCR reactions are somewhat more technically feasible. However, noneof these techniques has been used for rapid testing of plasmidsequences.

[0085] Sequences of several bacterial plasmids have been obtained,increasingly in the course of genome sequencing projects. A listing ofbacterial plasmids sequenced to date is currently maintained by theNational Center for Biotechnology Information (NCBI) and can bereferenced at the NCBI website

[0086] (www.ncbi.nlm.nih.gov/PMGifs/Genomes/eub_p.html). One can readilysee that large bacterial plasmids are relatively common in bacteria, andare likely to be present in a great many strains. TABLE 1 Sequencingplasmid genomes vs. microbial genomes Fold Relative Size of coverage BpEfficiency Genome genome needed to sequence of new method Large/complex1 × 10⁶ 8   8 × 10⁶  5-fold plasmid genome Small/less 2 × 10⁵ 8 1.6 ×10⁶ 25-fold complex plasmid genome Bacteria genome 5 × 10⁶ 8   4 × 10⁷ —

[0087] TABLE 2 Calculation of clones needed to cover plasmid genomesApprox. size of plasmid Number to Average genome (bp); clones neededNumber of Clone assuming 10 Fraction to represent sequencing sizeplasmids of of Genome plasmid genome reactions to (bp) 100 kb size perclone 95% confidence)* sample genome 20,000 1 × 10⁶   2 × 10⁻²  148  29615,000 1 × 10⁶ 1.5 × 10⁻²  198  396 10,000 1 × 10⁶ 1.0 × 10⁻²  298  596 5,000 1 × 10⁶ 5.0 × 10⁻³  597 1194  1,000 1 × 10⁶ 1.0 × 10⁻³ 2994 5988  800 1 × 10⁶ 8.0 × 10⁻⁴ 3745 7490

[0088] TABLE 3 Estimates of advantage of novel system over currentmethods Efficiency Number of vs sequence clones Number of of randomneeded to sequencing 800 bp represent Number of Percent reactions toinserts plasmid sequencing of finish sequence (7490 reads; Averagegenome reactions clones of each novel Estimated seq see table 2) Clone(95% to sample detected Clone to clone (est. reactions per shown as %Fold size (bp) confidence) genome as novel sequence 1.5/kb) genome ofreactions improvement 20,000 148 296 50 74 30*74 = 2220 2220 + 296 =2516 33% 3-fold 10 15 30*15 = 450 450 + 296 = 746 10% 10-fold 1 2 30*2 =60 60 + 296 = 356 4.7% 20-fold 15,000 198 396 50 99 22.5*99 = 22282228 + 396 = 2624 35% 3-fold 10 20 22.5*20 = 450 450 + 396 = 846 11%8-fold 1 2 22.5*2 = 45 45 + 396 = 441 5.9% 16-fold 10,000 298 596 50 14915*149 = 2235 2235 + 596 = 2831 38% 2.5-fold 10 30 15*30 = 450 450 + 596= 1046 14% 7-fold 1 3 15*3 = 45 45 + 596 = 641 8.5% 12-fold 5,000 5971194 50 597 7.5*597 = 4477 4477 + 1194 = 5671 76% 1.3-fold 10 1197.5*119 = 893 893 + 1194 = 2087 28% 3.6-fold 1 12 7.5*12 = 90 90 + 1194= 1284 17% 5.8-fold 1,000 2994 5988 50 2994 1.5*2994 = 4491 4491 + 5988= 10479 139% 0 10 599 1.5*599 = 899 899 + 5988 = 6887 92% 0 1 60 1.5*60= 90 90 + 5988 = 6078 81% 1.2-fold

[0089] The following examples are offered by way of illustration and notby way of limitation.

EXPERIMENTAL Example 1 Rapid Capture of Diversity From BacillusThurineiensis Strains

[0090] The following is an example of how one might practice theinvention in the case of a strains or strains where there is likely tobe little redundancy with previously known sequences:

[0091] Purification of Episomal DNA From a Bacillus Culture

[0092] To clone and sequence plasmid DNA from a Bacillus strain onefirst needs to prepare purified plasmid DNA. Ideally, one will purify100 ug or more of purified plasmid DNA. A starter culture of theBacillus strain should be grown in 5 ml of LB overnight at 37° C. withaeration. This 5m1 culture is then used to inoculate a 100-250 mlculture which should be grown for 8 hours at 37° C. with aeration. Theyoung cells in this culture will be easier to lyse. The cells can beharvested at 6000 rpm in a Sorval SS34 rotor for 15 minutes. The cellpellet should be resuspended in 20 ml STE (10 mM Tris p118, 0.1M NaCl, 1mM EDTA pH8) to remove all media and then centrifuged again at 6000 rpmfor 15 minutes. After removing all traces of the STE, the cell pelletsmay be frozen overnight. The thawed pellets should be resuspended byvortexing in an appropriate amount of 50 mM Tris pH8.0, 10 mM EDTA p118,50 mM glucose and 100 ug/ml RNaseA. Use 5 ml of this buffer for every100 ml of cell culture. A large amount of powdered lysozyme should beadded to the resuspended cells. Incubating the cells at 37° C. for atleast one hour helps improve cell lysis. After incubation with lysozymethe cells are lysed by alkaline lysis. 15 ml of 200 mM sodium hydroxideand 1% SDS should be added per 100 ml of cell culture to ensure completelysis. Mix by inversion and incubate at room temperature for 5 minutes.15 ml of 3M potassium acetate pH 5.5 should be added per 100 ml of cellculture and mixed by inversion. The precipitate should be removed bycentrifugation at 13,000 rpm for 30 minutes in the Sorvall SS34. The supernatant should be filtered through a piece of Whatman paper pre-wettedwith dH₂0. A Qiatip-500 column should be equilibrated with 10 ml BufferQBT. The filtered supernatant should be applied to the column and theflow-through discarded. The column should be washed twice with 30 mlBuffer QC. The DNA should be eluted from the column with 15 ml of BufferQF that has been warmed to 65° C. 10.5 ml of isopropanol should be addedto the eluted DNA. The DNA is precipitated overnight at −20° C., and theprecipitated DNA is centrifuged at 13000 rpm in the Sorvall SS34 rotorfor 45 minutes. The supernatant removed and the pellet is washed in 10ml 70% ethanol. Centrifuge 30 minutes at 13000 rpm. The pellets aredried at room temperature then resuspended in 1 ml of TE (10 mM Tris, 1mM EDTA, pH 8.0). Resuspend the pellet overnight at 4° C. to ensuredissolution of the plasmid DNA. Check for the presence of plasmid DNA byelectrophoresing 10 ul of the plasmid DNA on a 0.5% agarose gel(pulse-field grade agarose) in 1× TAE at 1.5-2 V/cm.

[0093] Phase I Screening of a Clones, and Dataset Buildup

[0094] A 100 ug aliquot of plasmid DNA is added to nebulizing buffer 50%glycerol and TM buffer (50 mM Tris, 8.1 mM MgSO4 pH7.5) to a volume of 2ml. The solution is added to the bottom of a nebulizer and incubated for10 minutes in an ethanol-dry ice mixture. The nebulizer is connected toa nitrogen tank and pressure is applied to the sample in a range of 8 to12 psi, varying from sample to sample. The sheared DNA is then dividedinto 8 portions and ethanol precipitated. The DNA is then resuspended inTE and end repaired using T4 polymerase, Klenow and T4 polynucleotidekinase. The end repaired DNA is then electrophoresed for size separationon a 1% low melt agarose gel at 75V for 2.5 hours. The DNA of desiredsize is excised from the gel, and extracted from the agarose usingQiaQuick® Gel Extraction kit (Qiagen) and subsequently concentrated byethanol precipitation. The DNA is checked for quality and quantity on a1% agarose gel run at 100V for 1 hour.

[0095] Fragmented, end-repaired, purified DNA is ligated into a suitablevector. For example, pBluescript™ (Stratagene) or pZero-2™ (Invitrogen)can be prepared by digesting with an enzyme generating a blunt end (e.g.EcoRV). The terminal phosphates on the ends of the vector may be removedwith calf intestinal phosphatase to reduce background colonies resultingfrom religation of vector. The ligations are performed at 12 degreesCelsius overnight and heat inactivated at 700 C for 25 minutes.Alternatively ligations are performed with an overnight incubation at25° C. Transformations are performed by adding 1 ul of the ligation mixto an aliquot of 30 ul of DH10B cells. The cell/DNA mixture istransferred to a cuvette that has been incubated on ice for 10 minutes.The cuvette is placed in the BioRad clectroporator and given a voltageof 1700 for 5ms. SOC is added at a volume of 1 ml to the cuvettes torecover the cells. The cells are transferred to culture tubes andincubated at 37° C. for 1 hour. The transformations are plated onto LBagar containing the appropriate antibiotic.

[0096] The colonies are picked into 96 well growth blocks containing 800ul of Terrific Broth with antibiotic. The blocks are covered with QiagenAirpore tape and grown overnight at 37° C. with shaking. Glycerol stocksof the growth are prepared by taking 20 ul of the culture and adding itto 20 ul of 40% glycerol. These are stored at −80° C. The 96 wellcultures are centrifuged at 4000 rpm for 10 minutes in a refrigeratedtabletop centrifuge.

[0097] Clone preparation is carried out in 96 well blocks using analkaline lysis protocol with a Whatman 96 well filter plate for lysateclearing. The DNA is then precipitated, resuspended in water, and run ona 0.8% agarose gel for quantification. The sequencing reactions areperformed by cycle sequencing using Applied Biosystems Big DyeTerminator kits and MJ tetrad thermocyclers. The reaction isprecipitated and run on ABI 3700 capillary sequencer for analysis.

[0098] Sequences resulting from reactions run through the ABI sequencerare transferred to a Sun workstation running a UNIX® operating system.The sequences are checked for quality score, trimmed to remove vectorsequences, and assembled using the Phred/Phrap program suite. Thesequences of all resulting contigs as well as all unassembled sequencesare combined in a directory that acts as a database.

[0099] Phase II. Use of Dataset to Rapidly Screen for Novel Gene Regionsand Capture Diversity.

[0100] In phase II, libraries of closely related species (for exampleBacillus thuringiensis komamtoensis) or unknown strains verified to berelated to Bacilus thuringiensis (e.g. by 16sRNA sequnce analysis orMIDI analysis of cell wall fatty acid composition) are generated asdescribed in STEP 2 of the method. For example, one performs a partialdigest of plasmid DNA with the enzyme Eco509I which generates a 5′overhang compatable with the restriction enzyme EcoRI. DNA migrating ata size of 5-25 kb, or more preferably 10-20 kb or more preferably 15 kbare excised, and ligated into a vector. One suitable vector would bepBluescript, digested with ecoRI. Alternatively, one may use a cloningvector such as pBeloBAC11 to accept large inserts. Alternatively, onemay develop and utilize specialized vectors to allow generation ofplasmid clones with inserts of 10-20 kb. In any case, the insert DNA isligated to the vector, and the ligation transferred into E.coli usingmethods known in the art (e.g. Sambrook and Russell, supra) or byfollowing manufacturer's instructions.

[0101] Regardless for the vector, clones from the resulting library arepicked at random and grown in 96 well format, and DNA prepared asdescribed above. Sequencing reactions are performed on one or both sideof the clone, and the resulting sequences are tested against theexisting database of plasmid sequences (from a Phase I project(s)).Clones having at least one unique end sequence are identified forfurther processing. This DNA is then digested with one of tworestriction enzyme that flank the insert but are likely to occur rarely(for example NotI or PmeI) after inactivating the restriction enzymesthe two digests are pooled. These pooled reactions are then mutagenizedwith Tn5 in vitro . One way in which one can achieve this by using acommercially available kit, for example by using the EZ::TN™ InsertionKit (Epicentre). After mutagenizing and removing transposase (forexample by phenol:chloroform extraction followed by ethanolprecipitation) reactions are ligated with T4 DNA ligase and transferredinto E.coli. Clones which receive a transposon insertion are identifiedby antibiotic resistance (the transposon encodes either kanamycin ortetracycline resistance). Antibiotic resistant clones are picked, andtheir end-sequences determined as described previously. One chooses asufficient number of clones to adequately cover the sequence of thenovel clones. The number one chooses depends on the size of the clones,and the number of reads per DNA length one desires. One may choose todetermine high quality sequence for each nucleotide of each clone.Alternatively one may not wish to determine the nucleotide sequence ofeach clone. It may be sufficient to sample the clone sequence such thatone has reasonable probability of identifying a commercially valuablegene.

Example 2 Capture of Episomal Diversity From Environmental Samples

[0102] In this example, one isolates plasmid DNA from a soil, water, orother type of environmental sample, and then generates and screenslibraries by end sequencing to identify novel DNA regions. One maysequence either one or both ends of the resulting clones.

[0103] Plasmid DNA from soil for example is isolated by the procedurelisted above, and further purified by Cesium chloride centrifugation.Purified plasmid DNA is fragmented, and 10-20 kb fragments as well asother size fragments (1-3 kb, 3-10 kb and 10-25 kb) are isolated byagarose gel electrophoresis. Alternatively, one may use vectors that donot require gel purification of fragments to achieve size selection.Purified fragments are ligated to a vector or vectors of choice, and theresulting mixture transferred into E. coli. Individual colonies arepicked, and DNA prepared for sequencing as above. Resulting sequence istested for novelty against a database, and novel sequences areidentified as described. Novel sequences are then added to the database.

Example 3 Algorithm for Data Parsing

[0104] Algorithms are useful to sort data, and to manage large amountsof information. One possible algorithm that may be used to identifyclones for further sequencing is described here. This type of algorithmcan be particularly useful in cases where one has generated a largedataset of existing sequences (such as bacterial plasmid sequences), andwishes to sequence only clones that do not have identify or highsimilarity to members of the database.

[0105] Algorithm

[0106] 1. Assign a label to each clone

[0107] 2. Send sequences to pool ‘A’

[0108] 3. Pre-blast sequences in pool ‘A’ to remove/mask sequences thatare repetitive in nature. (e.g. transposon sequences or vectorsequences.) Send these sequences to pool ‘B’

[0109] 4. Blast search of n number sequences in pool ‘B’

[0110] 5. Place sequences in pools based on results of blast search ofpool ‘B’

[0111] a. Ife>10-1, then send to pool ‘Failblast’

[0112] b. If<e10-1, then send to pool ‘C’

[0113] c. Of Clones in pool ‘C’ if score<10-10, send to pool ‘D’.If>10-10, then send to pool ‘Failblast-10’

[0114] d. Of clones in pool ‘D’, if score<10-50, send to pool ‘E’If>10-50, then send to pool ‘Failblast-50’

[0115] e. Of clones in pool ‘E’, if score<10-100, send to pool ‘F’If>10-100, then send to pool ‘Failblast-100’

[0116] f. Of clones in pool ‘F’ if score=0.0, send to pool ‘Identical’.If score is not 0.0, send to ‘Failblast-not identical’

[0117] 6. Set clones into pools based on cumulative results of blast ofboth (or multiple) end sequences.

[0118] For each sequence in pool ‘Failblast’, does the sequence have apartner sequence in pool B? If so, sort based on homology of both.

[0119] a. If sequence in pool Failblast does not have a partner sequencein pool ‘B’ then send the clone to clonepool ‘B-9’

[0120] b. If the sequence does have a partner sequence,

[0121] c. If the partner sequence is in pool ‘failblast’, then place theclone in clonepool ‘B-1’.

[0122] d. If the partner sequence is in pool ‘FailBlast-10’, then placethe clone in clonepool ‘B-2’.

[0123] e. If the partner sequence is in pool ‘FailBlast-50’, then placethe clone in clonepool ‘B-3’

[0124] f. If the partner sequence is in pool ‘FailBlast-100’, then placethe clone in clonepool ‘B-4’

[0125] g. If the partner sequence is in pool ‘FailBlast-not identical’,then place the clone in clonepool ‘B-5’.

[0126] h. If the partner sequence is in pool ‘Identical’, then place theclone in clonepool ‘B-6’.

[0127] Repeat Operation 1 for Each Sequence in Pool Failblast-10

[0128] a. If sequence in pool Failblast-10 does not have a partnersequence in pool ‘B’ then send the clone to clonepool ‘C-9’

[0129] b. If the sequence does have a partner sequence,

[0130] c. If the partner sequence is in pool ‘failblast’, then ignorethe clone (since it should already be in clonepool ‘B-2’.

[0131] d. If the partner sequence is in pool ‘FailBlast-10’, then placethe clone in clonepool ‘C-2’.

[0132] e. If the partner sequence is in pool ‘FailBlast-50’, then placethe clone in clonepool ‘C-3’

[0133] f. If the partner sequence is in pool ‘FailBlast-100’, then placethe clone in clonepool ‘C-4’

[0134] g. If the partner sequence is in pool ‘FailBlast-not identical’,then place the clone in clonepool ‘C-5’.

[0135] h. If the partner sequence is in pool ‘Identical’, then place theclone in clonepool ‘C-6’.

[0136] Repeat Operation 1 for Sequences in Pool Failblast-50

[0137] a. If sequence in pool Failblast-50 does not have a partnersequence in pool ‘B’ then send the clone to clonepool ‘D-9’

[0138] b. If the sequence does have a partner sequence,

[0139] c. If the partner sequence is in pool ‘failblast’, then thenignore the clone (since it should already be in clonepool ‘B-3’).

[0140] d. If the partner sequence is in pool ‘FailBlast-10’, then ignorethe clone (since it should already be in clonepool ‘C-3’).

[0141] e. If the partner sequence is in pool ‘FailBlast-50’, then placethe clone in clonepool ‘D-3’

[0142] f. If the partner sequence is in pool ‘FailBlast-100’, then placethe clone in clonepool ‘D-4’

[0143] g. If the partner sequence is in pool ‘FailBlast-not identical’,then place the clone in clonepool ‘D-5’.

[0144] h. If the partner sequence is in pool ‘Identical’, then place theclone in clonepool ‘D-6’.

[0145] Repeat Operation 1 for sequences in pool Failblast-100

[0146] a. If sequence in pool Failblast-100 does not have a partnersequence in pool ‘B’ then send the clone to clonepool ‘E-9’

[0147] b. If the sequence does have a partner sequence,

[0148] c. If the partner sequence is in pool ‘failblast’, then ignorethe clone (since it should already be in clonepool ‘B-4’).

[0149] d. If the partner sequence is in pool ‘FailBlast-10’, then ignorethe clone (since it should already be in clonepool ‘C-4’).

[0150] e. If the partner sequence is in pool ‘FailBlast-50’, then ignorethe clone (since it should already be in clonepool ‘D-4’)

[0151] f. If the partner sequence is in pool ‘FailBlast-100’, then placethe clone in clonepool ‘E-4’

[0152] g. If the partner sequence is in pool ‘FailBlast-not identical’,then place the clone in clonepool ‘E-5’.

[0153] h. If the partner sequence is in pool ‘Identical’, then place theclone in clonepool ‘E-6’.

[0154] Repeat Operation 1 for Sequences in Pool Failblast-not Identical

[0155] a. If sequence in pool ‘Failblast-not identical’ does not have apartner sequence in pool ‘B’ then send the clone to clonepool ‘E-9’

[0156] b. If the sequence does have a partner sequence,

[0157] c. If the partner sequence is in pool ‘failblast’, then ignorethe clone (since it should already be in clonepool ‘B-5’).

[0158] d. If the partner sequence is in pool ‘FailBlast-10’, then ignorethe clone (since it should already be in clonepool ‘C-5’).

[0159] e. If the partner sequence is in pool ‘FailBlast-50’, then ignorethe clone (since it should already be in clonepool ‘D-5’)

[0160] f. If the partner sequence is in pool ‘FailBlast-100’, thenignore the clone (since it should already be in clonepool ‘E-5’

[0161] g. If the partner sequence is in pool ‘FailBlast-not identical’,then place the clone in clonepool ‘F-S’.

[0162] h. If the partner sequence is in pool ‘Identical’, then place theclone in clonepool ‘F-6’.

[0163] Repeat Operation 1 for Sequences in Pool Identical

[0164] a. If sequence in pool Identical does not have a partner sequencein pool ‘B’ then send the clone to clonepool ‘G-9’

[0165] b. If the sequence does have a partner sequence,

[0166] c. If the partner sequence is in pool ‘failblast’, then ignorethe clone (since it should already be in clonepool ‘B-6’).

[0167] d. If the partner sequence is in pool ‘FailBlast-10’, then ignorethe clone (since it should already be in clonepool ‘C-6’).

[0168] e. If the partner sequence is in pool ‘FailBlast-50’, then ignorethe clone (since it should already be in clonepool ‘D-6’)

[0169] f. If the partner sequence is in pool ‘FailBlast-100’, thenignore the clone (since it should already be in clonepool ‘E-6’)

[0170] g. If the partner sequence is in pool ‘FailBlast-not identical’,then ignore the clone (since it should already be in clonepool ‘F-6’).

[0171] h. If the partner sequence is in pool ‘Identical’, then place theclone in clonepool ‘G-6’.

[0172] 7. Report generation and parsed files.

[0173] One can combine Clonepools based on desired set for analysis. Forexample, to receive only the most unique clones, output could containClonepools B-1, B-2, B-3, B-9,C-2, C-3 and D-4. For example, a printoutis created of all members starting with pool B-1, and progressing topool G-6. Parsing can be a simple command such as “copy all files withsequence in clone pools B, C, D to directory ‘Novel sequences-date”’wherein the directory is created, and sequences passing test are copiedto new directory. Similarly, non-novel sequences can be parsed to adifferent directory, for example “previously identified.” Alternatively,the clone pools passing the criteria may be sent to other programs thatfurther process the information. For example, one may wish to searchsequences for those with some homology (but not identity) to known genesof interest. One may accomplish this by for example, testing clonepoolsin searches that involve hypothetical translation of the DNA sequence;typically in all 6 possible reading frames.

Example 4 Identification of Novel Endotoxin Genes

[0174] Plasmid DNA from strain ATX13026 was prepared by growing andharvesting the cells in a large culture. The plasmid DNA was extractedby treatment of cell pellet with 4% SDS for 30 minutes, neutralizationwith Tris, and a subsequent incubation with 20 mM NaCl on ice. The DNAwas precipitated by isopropanol precipitation and then further purifiedby CsCl centrifugation. Purified plasmid DNA was sheared by passagethrough a nebulizer (Invitrogen, Catalog no. K7025-05) using 8 psi for2.5 minutes. Sheared DNA was separated by size by electrophoresis on anagarose gel, and DNA of the appropriate size excised, and purified bymethods known in the art. The 5′ and 3′ termini of the purifiedfragments were converted to blunt ends using a treatment with T4 DNApolymerase, Klenow large fragment at 25° C. in the presence of all 4dNTPs followed by incubation with T4 polynucleotide kinase at 37° C. Theblunt end fragments were then ligated into a vector, and transformedinto E. coli. Individual clones were picked into wells of 96 wellplates, and grown to saturation at 37° C. Plasmid DNA was prepared fromthese cells by methods known in the art, and the DNA sequences of theends of 10,000 clones were obtained. Sequence files from a number ofsequencing reactions were analyzed by phredPhrap/Consed suite ofprograms. Contigs resulting from this analysis were then tested forpresence of novel endotoxins by comparing the sequences against adatabase of known endotoxin proteins using the BLASTX algorithm. TABLE 4Novel endotoxin-containing clones identified by the method Clone AminoAcid homology to endotoxin pAX006 33% cry4Aa pAX007 36% cry4Aa pAX00867% cry40Aa1 pAX009 34% cry8Ba pAX010 35% cry36Aa1 pAX014 55% cry40Aa1

[0175] Using this sampling, the clones containing homologies toendotoxins were identified and sequenced in the regions predicted tocontaining endotoxin genes. Sequence analysis of the open reading framesobtained by this sequencing identified novel endotoxin genes. The genesidentified by this method are not likely to hybridize to the set ofknown genes, due to the low level of amino acid and DNA homology betweenthese genes and known genes.

Example 5 Identification of a Novel Cellulase

[0176] A database of cellulases, xylanases and other lignocellulosedegrading enzymes was created from existing known amino acid sequences.The database of end sequences from strain ATX13026 was tested forpresence of lignocellulose degrading enzymes. Clone pAXE001 was found tohave strong homology to a known cellulases. TABLE 5 Novel Cellulaseidentified by the method Clone Amino Acid homology to cellulase pAXE00184% to cellulase, genbank accession number A44808

Example 6 Construction of miniCos-I

[0177] First, Supercos (Stratagene) was linearized with EcoRI, and the5′ overhangs filled by incubation with Klenow and dNTPs as known in theart (Sambrook and Russell, supra). The linearized vector was thendigested with Hpa I. The 5.5 kilobase fragment containing the COS sites,kanamycin resistance gene, and the SV40 replication origin was purifiedby agarose gel electrophoresis.

[0178] Oligonucleotides were designed such that a PCR reaction usingoligo 1 and 2 amplified a portion of Supercos containing the origin ofreplication and ampicillin resistance gene. Oligo 1 incorporated singlelox site, and a SwaI site oriented such that the PCR product contains alox site internal to a SwaI site. Oligo 2 incorporated a novel multiplecloning site. Using Oligos 1 and 2, a PCR product was generated fromSupercos. The PCR product was gel purified, and subjected to a secondPCR reaction with oligonucleotide 1 and oligonucleotide 3. Oligo 3 wasdesigned such that it overlapped Oligo 2, and incorporated a loxrecombinase site, as well as a SwaI restriction site into the PCRproduct, oriented as for Oligo 1. The 3′ single stranded nucleotidesgenerated by the polymerase were removed by incubation with Klenowfragment of DNA polymerase and dNTPs, and 5′ phosphates added byincubation with T4 DNA polynucleotide kinase and ATP as known in theart.

[0179] PRE112 from ATCC 87692 (Edwards et al. (1998) Gene 207:149-157)was digested with EcoRI, and the fragment containing the sacb geneisolated by agarose gel electrophoresis, and the 5′ overhangs filled byincubation with Klenow and dNTPs as known in the art (Sambrook andRussell, supra).

[0180] The blunt ended PCR product was ligated to the 5.5 kb fragment ofSupercos, transformed into E. coli, and DNA of the correct constructs(referred to herein as Tempclonegl) was verified by restrictiondigestion and DNA sequencing. Tempclone#1 was then digested with SmaI,treated with calf intestinal phosphatase, and ligated to the sacBfragment from PRE112. Clones containing the correct ligation productswere identified as known in the art. The presence of the kanamycinresistance, ampicillin resistance, and sacB markers was confirmed bytesting in E. coli, and a positive clone, referred to herein asTempclone#2, was identified.

[0181] Tempclone#2 was digested with AccIII, and ligated to a DNA linkerdesigned to incorporate restriction sites for the enzymes ApaI and BsiWIinto tempclone #2. This yielded Tempclone#3.

[0182] By analyzing the DNA sequence of lambda phage, a DNA regionapproximately 9 kb in size was identified that lacked restriction sitesfor XbaI, SwaI, NotI, and all other enzymes in the multiple cloningsite. Lambda DNA (New England Biolabs) was digested with ApaI and BsiWI,and the 9 kb fragment was isolated.

[0183] Tempclone#3 was digested with ApaI and BsiWI, ligated to the 9 kblambda fragment, and transformed into E. coli. Clones containing thelambda insert were confirmed by restriction digest. The final clone isreferred to as miniCos-I.

[0184] All publications and patent applications mentioned in thespecification are indicative of the level of skill of those skilled inthe art to which this invention pertains. All publications and patentapplications are herein incorporated by reference to the same extent asif each individual publication or patent application was specificallyand individually indicated to be incorporated by reference.

[0185] Although the foregoing invention has been described in somedetail by way of illustration and example for purposes of clarity ofunderstanding, it will be obvious that certain changes and modificationsmay be practiced within the scope of the appended claims.

That which is claimed:
 1. A method for identifying a novel nucleotidesequence, comprising: a) generating a library comprising at least oneextrachromosomal DNA clone, b) obtaining a sequence for a portion ofsaid DNA clone, wherein the length of said sequence is less thanone-third of the length of said clone; c) comparing said sequenceagainst a database comprising existing DNA sequences; d) repeating stepsa) through c) to generate a set of clonal sequences; e) parsing said setof clonal sequences using an algorithm that parses sequences based onthe presence or absence of said clonal sequence in said database; and,f) identifying at least one novel nucleotide sequence.
 2. A method foridentifying a novel nucleotide sequence, comprising: a) generating alibrary of extrachromosomal DNA clones, b) sequencing a portion of saidDNA clones, wherein the length of each sequence generated is less thanone-third of said clone length; c) comparing the sequences of saidlibrary against a database of existing DNA sequences; d) using analgorithm to select said novel nucleotide sequence based on the presenceor absence of said portion in a database; and, e) identifying at leastone novel nucleotide sequence.
 3. The method of claim 2, wherein saidsequence is translated to obtain all possible amino acid sequences andwherein said amino acid sequences are compared to a protein database. 4.The method of claim 2, wherein said novel nucleotide sequence sharesless than 30% sequence homology with any sequence in said database. 5.The method of claim 2, wherein said novel nucleotide sequence sharesless than 60% sequence homology with any sequence in said database. 6.The method of claim 2, wherein said novel nucleotide sequence sharesless than 80% sequence homology with any sequence in said database. 7.The method of claim 2, wherein said novel nucleotide sequence sharesless than 90% sequence homology with any sequence in said database. 8.The method of claim 2, wherein said extrachromosomal DNA clones withinsaid library are about 10 to about 20 kb in size.
 9. The method of claim8, wherein said extrachromosomal DNA clones within said library areabout 15 kb in size.
 10. The method of claim 2, wherein saidextrachromosomal DNA clones within said library are about 1 to about 5kb in size.
 11. The method of claim 2, wherein said extrachromosomal DNAclones within said library are about 1.5 kb in size.
 12. The method ofclaim 2, further comprising a step of mutagenizing said selected clones.13. The method of claim 12, wherein said mutagenizing is accomplishedusing a transposable element.
 14. The method of claim 2, wherein step c)utilizes BLASTX.
 15. The method of claim 2, wherein said said library isgenerated from bacteria.
 16. The method of claim 2, wherein said libraryis generated from an organism selected from the group consisting ofClostridia, Bacillus, Agrobacterium, and Rhizobium.
 17. The method ofclaim 16, wherein said organism is Bacillus.
 18. The method of claim 17,wherein said organism is Bacillus thuringiensis.
 19. The method of claim2, wherein said said library is generated from a fungus.
 20. The methodof claim 2, wherein said novel nucleotide sequence encodes an insectcontrol gene.
 21. The method of claim 20, wherein said insect controlgene is a delta-endotoxin.
 22. The method of claim 2, wherein said novelnucleotide sequence encodes a lignocellulose-degrading enzyme.
 23. Themethod of claim 22, wherein said lignocellulose-degrading enzyme is acellulase.
 24. The method of claim 2, wherein said library is generatedusing a vector comprising a stuffer fragment and at least one cos site.25. The method of claim 2, wherein said database is Genbank.
 26. Themethod of claim 2, wherein said database comprises only known endotoxinproteins.
 27. The method of claim 2, wherein said database comprisesonly known lignocellulose-degrading enzymes.